# Dataset Card for ClinicalBench

ClinicalBench: An End-to-End, Real-Case-based, Data-Leakage-Free Benchmark for Multi-Department Clinical Diagnostic Evaluation

## Table of Contents

- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks](#supported-tasks)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Fields](#data-fields)
  - [Data Instances](#data-instances)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Data Sources & Licenses](#data-sources--licenses)
  - [Data Annotations & Quality](#data-annotations--quality)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Discussion of Biases](#discussion-of-biases)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
    

## Dataset Description

- **Homepage:** https://github.com/WeixiangYAN/ClinicalLab
- **Repository:** https://github.com/WeixiangYAN/ClinicalLab
- **Paper:** https://openreview.net/attachment?id=pSAyeAqPNo&name=pdf
- **Point of Contact:** Weixiang Yan (yanweixiang.ywx@gmail.com)

### Dataset Summary

ClinicalBench is a fine-grained evaluation benchmark specifically designed for multi-departmental clinical diagnosis, covering 24 departments such as pediatrics, orthopedics, and neurosurgery. It involves 150 different diseases, each comprising 10 specific cases, totaling 1500 samples, with an average of about 1000 tokens per case. Each case in ClinicalBench contains detailed clinical data, such as the patient's gender, age, chief complaint, medical history, and physical examination. Additionally, the cases include various medical imaging reports, such as X-rays, computed tomography (CT) scans, magnetic resonance imaging (MRI), and ultrasound examinations, as well as biochemical, immunological, microbiological, and pathological laboratory examination results from biological samples such as blood, urine, and cerebrospinal fluid.

### Supported Tasks

ClinicalBench systematically evaluates the end-to-end practicality of LLMs in clinical diagnosis by simulating the complete patient visit process, from the patient's entry into the hospital to their discharge. We divide the entire process into 8 specific tasks, covering various stages from preliminary reception to final diagnosis and treatment plan formulation. These tasks are: Department Guide, Preliminary Diagnosis, Diagnostic Basis, Differential Diagnosis, Final Diagnosis, Principle of Treatment, Treatment Plan, Imaging Diagnosis. The detailed task introduction can be found in Section 3.4 of the paper.

### Languages

Chinese and English

## Dataset Structure

The ClinicalBench dataset adopts an application access system. After users agree to the **ClinicalBench Usage and Data Distribution License Agreement** and submit an application, we will send the dataset to the email address provided by the user within 48 hours. You can apply for access to and use of the ClinicalBench dataset at the following URL: https://forms.gle/Tkq5UTinW7bBB6388

### Data Fields

Data samples are arranged as follows:

- `id`: the counting sequence number of the data sample.
- `clinical_case_uid`: the identification number of the data sample.
- `language`: the language of the data sample, there are two options: Chinese(zh) and English(en).
- `clinical_department`: the name of the department corresponding to the data sample.
- `principal_diagnosis`: the final diagnosis results annotated by human doctors.
- `preliminary_diagnosis`: the preliminary diagnosis results annotated by human doctors.
- `diagnostic_basis`: the diagnostic basis annotated by human doctors for the preliminary diagnosis results.
- `differential_diagnosis`: the human doctor lists the possible diseases that may be causing the current symptoms of the patient corresponding to the data sample, and briefly explains why they are excluded.
- `treatment_plan`: the treatment plans annotated by human doctors.
- `clinical_case_summary`: the patient's chief complaint, physical examination, medical history, auxiliary examinations and other information.
- `imageological_examination` is a list of the necessary imageological examinations prescribed by human doctors for patients corresponding to the data samples, each imageological examination contains:
  - `findings`: a detailed natural language description written by a human doctor based on the image.
  - `impression`: the diagnosis results written by human doctors based on the findings.
- `laboratory_examination` is a list of the necessary laboratory examinations prescribed by human doctors for patients corresponding to the data samples, each laboratory examination contains:
  - `findings`: the natural language descriptions of laboratory test results.
  - `impression`: the diagnosis results written by human doctors based on the findings.
- `pathological_examination`: the pathological examination results of the patients corresponding to the data samples.
- `therapeutic_principle`: the treatment principles annotated by human doctors.

### Data Instances

Here's an example:

```
{
    "id": 1,
    "clinical_case_uid": "2e4ff11eaa244c2d8124e537d2b061e3",
    "language": "en",
    "clinical_department": "Gastroenterology",
    "principal_diagnosis": "Rupture and bleeding of esophagogastric varices",
    "preliminary_diagnosis": "1. Upper gastrointestinal bleeding; 2. Uptured esophagogastric varices bleeding; 3. Liver cirrhosis; 4. Anemia; 5. Electrolyte imbalance.",
    "diagnostic_basis": "1. History of chronic Hepatitis B, and vomiting blood for 2 days after eating hard food.\n2. Physical examination supports the diagnosis: (1) Flat abdomen, no gastrointestinal peristaltic waves, abdominal breathing present, no visible abdominal wall vein varicosity. (2) Soft abdomen, no fluid wave or shifting dullness, no palpable masses, no significant tenderness or rebound pain, liver and spleen not palpable below the ribs, Murphy's sign negative, no evident renal tenderness or percussion pain, no abnormal vascular pulsation in the abdomen. (3) No significant tenderness at bilateral ureteral pressure points. Liver dullness present, upper boundary at the right mid-clavicular line at the fifth intercostal space, no shifting dullness. (4) Normal bowel sounds.\n3. Imaging examinations support the diagnosis: (1) CT Scan (Plain + Contrast) showing liver cirrhosis, splenomegaly, varices at the lower end of the esophagus and the gastric fundus, and varices in front of the spleen. (2) MRI (Plain + Contrast) indicating liver cirrhosis, fibrosis, enlarged spleen, portal hypertension. (3) Endoscopy (Esophagus, Stomach, Duodenum) revealing ruptured esophageal varices and portal hypertensive gastropathy.\n4. Laboratory examinations support the diagnosis: (1) Routine Blood Test shows: decreased red blood cells (RBC), decreased hemoglobin (HGB), and decreased hematocrit (HCT). (2) Blood biochemistry Test shows: increased aspartate aminotransferase (AST), decreased total protein (TP), decreased albumin (ALB), decreased albumin/globulin ratio (A/G), increased total bilirubin (TBIL), increased direct bilirubin (DBIL), increased indirect bilirubin (IBIL), decreased prealbumin (PA), decreased calcium (Ca), decreased natrium (Na), and decreased osmolarity (OSM). (3) Coagulation Function Test shows: increased prothrombin time (PT), increased thrombin time\# (TT\#), decreased fibrinogen\# (Fg\#), decreased percent activity (PT\%), and increased International Normalized Ratio (PT.INR). ",
    "differential_diagnosis": "1. Gastric and Duodenal Ulcer with Bleeding: Bleeding is a common complication of ulcer disease. Minor bleeding often presents with no clinical symptoms and is only detected during fecal occult blood tests. A bleed greater than 500ml is considered severe, primarily manifested as vomiting blood, bloody stools, and varying degrees of anemia. In patients with a history of ulcer disease presenting with significant gastrointestinal bleeding, gastric and duodenal ulcers should be the first consideration.\n2. Mallory-Weiss Tear: This condition involves a longitudinal mucosal tear at the gastroesophageal junction or cardia leading to upper gastrointestinal bleeding, with 85\% of patients presenting with symptoms of vomiting blood. The typical presentation occurs after an episode of nausea or vomiting. Gastroscopy can diagnose this condition by identifying active bleeding, adherent blood clots, or a fibrin crust on or near the mucosal tear at the gastroesophageal junction.\n3. Gastrointestinal Tumor with Bleeding: About 5\% of cases may experience significant bleeding, presenting as vomiting blood and/or melena (black stools). It is commonly seen in individuals over 40 years old, especially males, who have recently experienced poor general condition, abdominal pain, or other gastrointestinal symptoms. Patients with a personal or family history of gastrointestinal tumors should be particularly considered.",
    "treatment_plan": "1. Based on the patient's condition, establish intravenous access, withhold food and water, and monitor vital signs;。\n2. For treatment, administer intravenous infusion of omeprazole and somatostatin to stop bleeding and protect the stomach from acid; ceftriaxone to prevent infection, and magnesium isoglycyrrhizinate to improve liver function abnormalities; regularly monitor complete blood count, and perform blood transfusion treatment when necessary; provide fluid replacement to maintain stability of electrolytes and acid-base balance, as well as nutritional support and other symptomatic treatments.\n3. Complete routine admission tests such as electrocardiograms and cardiac echocardiography, determine surgical indications, rule out contraindications for surgery, and then schedule endoscopic surgery when appropriate.",
    "clinical_case_summary": "Case Summary\nPatient Basic Information: Middle-aged male, XX years old. (We anonymize the age information in the sample data presented.)\nChief Complaint: Vomiting blood for 2 days after eating.\nMedical History: The patient experienced vomiting of coffee-colored gastric contents (approximately 100ml) accompanied by dizziness, palpitations, and weakness after consuming hard food 2 days ago. There was no abdominal distension, pain, melena, or bloody stool, nor any confusion. The patient was treated conservatively with acid-suppressing and hemostatic medications, after which symptoms of vomiting blood improved. The patient has a history of chronic Hepatitis B for three years, which has not been treated. \nPhysical Examination: Pale skin and mucous membranes, flat abdomen with no visible peristaltic waves and presence of abdominal breathing. No abdominal wall vein varicosity was observed. The abdomen was soft without fluid wave or shifting dullness, and no palpable masses. There was no significant tenderness or rebound tenderness, and the liver and spleen were not palpable below the ribs. Murphy's sign was negative. No evident kidney area tenderness or percussion pain, and no abnormal vascular pulsation in the abdomen. No significant tenderness at bilateral ureteral pressure points. Liver dullness was present, with the upper boundary at the right mid-clavicular line at the fifth intercostal space, with no shifting dullness. Bowel sounds were normal. \nAuxiliary Examination：\n（1）Imaging Examination：\nCT Scan (Plain + Contrast): 1. Ground glass nodule in the lower lobe of the right lung, suggest a follow-up CT in 3-6 months. 2. Linear opacities in the lower lobes of both lungs. 3. Liver cirrhosis, splenomegaly, varices at the lower end of the esophagus and the gastric fundus, and varices in front of the spleen. 4. Possible subcapsular hemangioma in liver segment S7, further examination with MRI suggested. 5. Multiple small cysts in the right lobe of the liver. 6. Fluid accumulation in the gallbladder fossa. 7. No apparent abnormalities in the lower abdominal CT scan.\nMRI (Plain + Contrast)：1. Liver cirrhosis, fibrosis; enlarged spleen; portal hypertension. 2. Small cyst in liver segment S5. 3. Minor fluid accumulation in the gallbladder fossa.\nEsophagogastroduodenoscopy: 1. Esophageal varices rupture treated with banding and tissue glue sclerotherapy. 2. Esophageal drug injection via gastroscopy. 3. Endoscopic hemostasis. 4. Portal hypertensive gastropathy.\n（2）Laboratory Examination：\nRoutine Blood Test: 1. Red Blood Cells (RBC) 3.0*10^12/L ↓; 2. Hemoglobin (HGB) 97g/L ↓; 3. Hematocrit (HCT) 27.9% ↓; 4. Platelet Count (Impedance Method) (PLT-I) 47*10^9/L ↓; 5. Mean Platelet Volume (MPV) 13.2fL ↑; 6. Plateletcrit (PCT) 0.06% ↓.\nBlood Biochemistry Test: 1. Aspartate Aminotransferase (AST) 60U/L ↑; 2. Total Protein (TP) 61.6g/L ↓; 3. Albumin (ALB) 31.7g/L ↓; 4. Albumin/Globulin Ratio (A/G) 1.11.2-2.4 ↓; 5. Total Bilirubin (TBIL) 41.5μmol/L ↑; 6. Direct Bilirubin (DBIL) 10.0μmol/L ↑; 7. Indirect Bilirubin (IBIL) 31.5μmol/L ↑; 8. Prealbumin (PA) 93.5mg/L ↓; 9. Calcium (Ca) 2.10mmol/L ↓; 10. Sodium (Na) 136mmol/L ↓; 11. Osmotic Pressure (OSM) 272mOsm/kg ↓.\nCoagulation Function Test: 1. Prothrombin Time# (PT#) 20.8S ↑; 2. Thrombin Time# (TT#) 19.5S ↑; 3. Fibrinogen# (Fg#) 1.1g/L ↓; 4. Percentage Activity (PT%) 43% ↓; 5. International Normalized Ratio (PT.INR) 1.810.85-1.25 ↑.\nTumor Marker Test: 1. Alpha-Fetoprotein (AFP) 307.2ng/mL ↑; 2. Carbohydrate Antigen 19-9 (CA19-9) 69.9U/mL ↑.\n(3) Pathological Examination: None at the moment.",
    "imageological_examination": {
      "plain_computed_tomography_scan+contrast_computed_tomography_scan": {
        "findings": "(1) Lungs: There is a ground-glass nodule in the dorsal segment of the right lower lobe, approximately 5mm x 4mm in size. There are strip-like high-density shadows in both lower lobes. (2) Mediastinum: The structures of both hilum are normal; trachea and bronchi are patent. No significantly enlarged lymph nodes seen in the mediastinum. The heart is normal in size, shape, and position. No pleural thickening on both sides. Dilated and tortuous vessels are visible at the lower end of the esophagus and the fundus of the stomach. (3) Liver: Increased volume of the left hepatic lobe with uneven parenchymal density and irregular liver margins. A small patchy slightly hyperdense shadow is seen subcapsularly in liver segment S7, about 1.1cm in diameter, showing progressive enhancement post-contrast. Multiple small round hypo-dense shadows are seen in the right lobe, the largest being about 0.7cm in diameter, with no enhancement post-contrast. No dilatation of the intrahepatic and extrahepatic bile ducts.  (4) Gallbladder: Normal size, no wall thickening, no abnormal density within the lumen, fluid seen in the gallbladder fossa. (5) Spleen: Enlarged spleen with no obvious abnormal enhancement, multiple dilated and tortuous vascular shadows anterior to the hilum. (6) Pancreas: Clear outline, normal shape and size, no abnormal density or pancreatic duct dilation. (6) Adrenal Area:  No significant abnormalities. (7) Kidneys: Symmetrical kidneys, normal in shape and size, no abnormal density. (8) Abdomen and Pelvis: No enlarged lymph nodes in the abdominal cavity and retroperitoneal space. Normal prostate morphology and size, no abnormalities within. Normal seminal vesicle glands in size, shape, and density. The bladder is well-filled, with no wall thickening, and no abnormal density within. No enlarged lymph nodes in both pelvic walls and inguinal areas.",
        "impression": "(1) Ground-glass nodule in the dorsal segment of the right lower lobe, recommend follow-up CT in 3-6 months. (2) Strip-like densities in both lower lobes. (3) Cirrhosis, splenomegaly, esophageal and gastric fundal varices, varices anterior to the spleen hilum. (4) Possible small hemangioma subcapsularly in liver segment S7, recommend further examination with MRI. (5) Multiple small cysts in the right lobe of the liver. (6) Fluid in the gallbladder fossa. (7) No significant abnormalities in the lower abdominal CT scan. "
      },
      "plain_magnetic_resonance_imaging_scan+contrast_magnetic_resonance_imaging_scan": {
        "findings": "(1) Liver: Not large in volume, with diffuse distribution of thin, reticular high signal T2 fat-suppressed strands; small round high signal T2 lesion in liver segment S5, about 6mm in diameter. Gallbladder is small, with no significant abnormal signal within; a small amount of liquid signal in the gallbladder fossa. (2) Spleen: Significantly enlarged, with uniform signal. (3) Pancreas and Kidneys: Regular shape, uniform signal. (4) Adjacent to the gastroesophageal junction and gastric fundus: Twisted small vascular shadows. Portal vein and splenic vein are thickened.",
        "impression": "(1) Cirrhosis, fibrosis. (2) Splenomegaly. (3) Portal hypertension.(4) Small cyst in liver segment S5. (5) Small amount of fluid in the gallbladder fossa."
      },
      "esophagogastroduodenoscopy": {
        "findings": "The passage through the esophagus was smooth, with moderate varices in the lower segment appearing beaded and exhibiting positive red signs. Five rings of esophageal variceal ligation were performed using a variceal banding device. The gastroesophageal junction was well-functioning and patent, with esophageal varices extending to the fundus of the stomach, where cluster-like varices were visible. Sandwich method applied: two sites injected with 10 ml of polidocanol each and 3 ml of tissue adhesive (6 vials each). The gastric body mucosa was inflamed and eroded. The mucosa of the gastric antrum was congested and edematous, primarily red with interspersed white, showing scattered small patches of erosion. The pylorus was round and well-functioning; no obvious abnormalities were observed in the duodenal bulb.",
        "impression": " (1) Esophageal variceal rupture with banding and tissue adhesive sclerotherapy. (2) Esophagogastroscopic medication injection. (3) Endoscopic hemostasis. (4) Portal hypertensive gastropathy."
      }
    },
    "laboratory_examination": {
      "routine_blood_test": {
        "result": "1 White Blood Cells WBC 5.9 *10^9/L 3.5-9.5 ;2 Lymphocytes Percentage LYMPH% 40.7 % 20.0-50.0 ;3 Monocytes Percentage MONO% 7.5 % 3.0-10.0 ;4 Neutrophils Percentage NEUT% 49.8 % 40.0-75.0 ;5 Absolute Lymphocyte Count LYMPH# 2.4 *10^9/L 1.1-3.2 ;6 Absolute Monocyte Count MONO# 0.44 *10^9/L 0.10-0.60 ;7 Absolute Neutrophil Count NEUT# 2.9 *10^9/L 1.8-6.3 ;8 Red Blood Cells RBC 3.0 ↓ *10^12/L 4.3-5.8 ;9 Hemoglobin HGB 97 ↓ g/L 130-175 ;10 Hematocrit HCT 27.9 ↓ % 40.0-50.0 ;11 Mean Corpuscular Volume MCV 92 fL 82-100 ;12 Mean Corpuscular Hemoglobin MCH 32 pg 27-34 ;13 Mean Corpuscular Hemoglobin Concentration MCHC 345 g/L 316-354 ;14 Red Cell Distribution Width (CV) RDW-CV 12.9 % <15.0 ;15 Platelet Count (Impedance Method) PLT-I 47 ↓ *10^9/L 125-350 ;16 Mean Platelet Volume MPV 13.2 ↑ fL 8.0-10.0 ;17 Platelet Distribution Width PDW 16.7 fL 9.0-17.0 ;18 Eosinophils Percentage EO% 1.7 % 0.4-8.0 ;19 Basophils Percentage BASO% 0.3 % 0.0-1.0 ;20 Absolute Eosinophil Count EO# 0.10 *10^9/L 0.02-0.52 ;21 Absolute Basophil Count BASO# 0.02 *10^9/L 0-0.06 ;22 Plateletcrit PCT 0.06 ↓ % 0.17-0.35 ;23 C-reactive Protein CRP 2.86 mg/L 0-4.00 ;",
        "abnormal": "1. Red Blood Cells (RBC) 3.0*10^12/L ↓; 2. Hemoglobin (HGB) 97g/L ↓; 3. Hematocrit (HCT) 27.9% ↓; 4. Platelet count (Impedance method) (PLT-I) 47*10^9/L ↓; 5. Mean Platelet Volume (MPV) 13.2fL ↑; 6. Plateletcrit (PCT) 0.06% ↓."
      },
      "blood_biochemistry_test": {
        "result": "1 Alanine Aminotransferase ALT 42 U/L 9-50; 2 Aspartate Aminotransferase AST 60 ↑ U/L 15-40; 3 Glutamic Oxaloacetic Transaminase/Alanine Aminotransferase AST/ALT 1.43; 4 Total Protein TP 61.6 ↓ g/L 65.0-85.0; 5 Albumin ALB 31.7 ↓ g/L 40.0-55.0; 6 Globulin GLB 29.9 g/L 20.0-40.0; 7 Albumin/Globulin Ratio A/G 1.1 ↓ 1.2-2.4; 8 Total Bilirubin TBIL 41.5 ↑ μmol/L <23.0; 9 Direct Bilirubin DBIL 10.0 ↑ μmol/L <4.0; 10 Indirect Bilirubin IBIL 31.5 ↑ μmol/L <19.0; 11 Alkaline Phosphatase ALP 99 U/L 40-150; 12 γ-Glutamyltransferase GGT 31 U/L 10-60; 13 Prealbumin PA 93.5 ↓ mg/L 200.0-430.0; 14 Glucose GLU 4.66 mmol/L 3.90-6.10; 15 Urea Urea 5.05 mmol/L 3.10-8.00; 16 Creatinine Cr 89 μmol/L 57-97; 17 Uric Acid UA 287 μmol/L 208-428; 18 Calcium Ca 2.10 ↓ mmol/L 2.11-2.52; 19 Potassium K 4.71 mmol/L 3.50-5.30; 20 Sodium Na 136 ↓ mmol/L 137-147; 21 Chloride Cl 100.2 mmol/L 99.0-110.0; 22 Osmotic Pressure OSM 272 ↓ mOsm/kg 275-300; 23 Hemolysis HEM -; 24 Jaundice ICT +; 25 Lipemia LIP -;",
        "abnormal": "1. Aspartate Aminotransferase (AST) 60U/L ↑; 2. Total Protein (TP) 61.6g/L ↓; 3. Albumin (ALB) 31.7g/L ↓; 4. Albumin/Globulin Ratio (A/G) 1.11.2-2.4 ↓; 5. Total Bilirubin (TBIL) 41.5μmol/L ↑; 6. Direct Bilirubin (DBIL) 10.0μmol/L ↑; 7. Indirect Bilirubin (IBIL) 31.5μmol/L ↑; 8. Prealbumin (PA) 93.5mg/L ↓; 9. Calcium (Ca) 2.10mmol/L ↓; 10. Sodium (Na) 136mmol/L ↓; 11. Osmotic Pressure (OSM) 272mOsm/kg ↓."
      },
      "coagulation_function_test": {
        "result": "1 Prothrombin Time# PT# 20.8 ↑ S 9.4-12.5 ;2 Activated Partial Thromboplastin Time# APTT# 36.5 S 25.1-36.5 ;3 Thrombin Time# TT# 19.5 ↑ S 10.3-16.6 ;4 Fibrinogen# Fg# 1.1 ↓ g/L 2.0-4.0 ;5 Percent Activity PT% 43 ↓ % 70-130 ;6 International Normalized Ratio PT.INR 1.81 ↑ 0.85-1.25 ;",
        "abnormal": "1. Prothrombin Time# (PT#) 20.8S ↑; 2. Thrombin Time# (TT#) 19.5S ↑; 3. Fibrinogen# (Fg#) 1.1g/L ↓; 4. Percentage activity (PT%) 43% ↓; 5. International Normalized Ratio (PT.INR) 1.81 (reference range: 0.85-1.25) ↑."
      },
      "tumor_marker_test": {
        "result": "1 Carcinoembryonic Antigen (CEA) 1.8 ng/mL ≤5.0; 2 Alpha-Fetoprotein (AFP) 307.2 ↑ ng/mL ≤7.0; 3 Cancer Antigen 125 (CA125) 9.4 U/mL ≤35.0; 4 Cancer Antigen 19-9 (CA19-9) 69.9 ↑ U/mL ≤25.0; 5 Cancer Antigen 72-4 (CA72-4) <2 U/mL <10.0;",
        "abnormal": "1. Alpha-Fetoprotein (AFP) 307.2ng/mL ↑; 2. Carbohydrate Antigen 19-9 (CA19-9) 69.9U/mL ↑."
      }
    },
    "pathological_examination": "Not available.",
    "therapeutic_principle": "1. Based on the patient's condition, establish intravenous access, withhold food and water, and monitor vital signs. 2. For treatment, administer intravenous infusion of omeprazole and somatostatin to stop bleeding and protect the stomach from acid; ceftriaxone to prevent infection, and magnesium isoglycyrrhizinate to improve liver function abnormalities; regularly monitor complete blood count, and perform blood transfusion treatment when necessary; provide fluid replacement to maintain stability of electrolytes and acid-base balance, as well as nutritional support and other symptomatic treatments. 3. Complete routine admission tests such as electrocardiograms and cardiac echocardiography, determine surgical indications, rule out contraindications for surgery, and then schedule endoscopic surgery when appropriate."
  }
```

### Data Splits

N/A, this is an evaluation benchmark.

## Dataset Creation

The dataset was created from the middle of 2023 to early 2024 at the Vaneval AI.

### Curation Rationale

Recent studies find that existing benchmarks cannot effectively evaluate the medical capabilities of LLMs. Firstly, existing benchmarks are often based on data collected from online consultation platforms or medical textbooks, which could easily be included in the training data of LLMs, that is, leading to **data leakage or contamination** and thus biasing the performance evaluation of LLMs. Secondly, the departmental setup in modern medicine is designed to address the complex medical needs of different structures and functions of human organs. The specific skills and treatment methods vary significantly across different departments. However, existing evaluation benchmarks overlook the characteristics of **multi-departmental and highly specialized nature** of modern medicine, hence they are insufficient in capturing performance differences across departments. Thirdly, existing evaluation methods typically confine themselves to multiple-choice questions, which does **not align with real-world clinical diagnostic scenarios**. In actual medical environments, patients seek medical services because they are uncertain about their health conditions, rather than knowing the possible disease options and then seeking a doctor's judgment. Last but not least, there is currently no evaluation method that can comprehensively and reliably evaluate the **end-to-end practicality** of LLMs in the entire clinical diagnosis process, which starts from the moment a patient enters the clinic and ends when the patient is discharged. This issue will in turn limit the design and evaluation of practical medical agents powered by LLMs and harm exploitation of the full potential of LLMs.

To address these limitations, we introduce ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for effectively and comprehensively evaluating the clinical diagnostic capabilities of LLMs.

### Source Data

#### Data Sources & Licenses

The data samples used in the ClinicalBench benchmark are sourced from real clinical medical records of officially certified Grade 3A hospitals in China (Grade 3A hospitals are the highest level hospitals in China's "three-grade, six-class" classification system.). The collection of this data strictly adheres to the principles of patient privacy protection. No information related to the hospitals is disclosed. As detailed in Data Processing & Quality, to protect patient privacy, any personally identifiable information (PII) of patients, treatment regions, or other sensitive information has been manually identified and removed by the team of doctors. All data is obtained legally and ethically, and has been reviewed and approved by the Ethics Committees of the relevant hospitals, ensuring that research activities on these data comply with ethical and legal obligations. Supporting documents and certification materials from notary institutions, which demonstrate the legality and ethicality of our data collection process, can be found in the supplementary materials.

We are committed to responsible data management and strictly follow relevant laws and regulations involving the collection, use, and distribution of protected health information. To ensure the legal and regulated use of the dataset, we have formulated the **ClinicalBench Usage and Data Distribution License Agreement**, which can be found in the supplementary materials. This agreement strictly requires all users to use the data solely for research purposes and to adhere to strict regulations protecting patient privacy, prohibiting any form of personal information tracking or identification. Through these measures, we ensure the legality and ethics of data acquisition and use while supporting research that may promote the development of LLMs in clinical diagnostics.

#### Who are the source language producers?

Participating hospital doctors.

### Data Processing & Quality

#### Annotation process

The ClinicalBench benchmark is manually created by three senior clinicians and two AI researchers. The creation process covers 4 key steps, as follows.

- The **Data collection** step focuses on authenticity, diversity, privacy. Based on department divisions and common disease types in each department, the medical team selects representative real cases for each disease from the hospital case database with permission for research. Given that these clinical case data is the private information of hospitals, the risk of data leakage to any LLMs is completely eliminated.
- The **Professional knowledge review** step ensures the accuracy of the data. The team of doctors conducts a detailed professional review of the diagnostic information, treatment process, and results of each case to ensure the medical accuracy and proficiency of the data.
- The **Privacy protection and de-identification** step ensures privacy protection. To protect patient privacy, the team of doctors conducts two rounds of independent reviews to identify and remove any content that could reveal patient identities, treatment regions, or other sensitive information.
- The **Data integrity and compliance check** step aims for completeness and ethical compliance. Two AI researchers are responsible for reviewing the data to ensure that each record is complete, and meets the medical task requirements set in Section 3.4 of the paper. Additionally, they reconfirm that the dataset does not contain any sensitive information and strictly complies with the ethical guidelines.
  See the paper for more details..

#### Who are the annotators?

Three senior clinicians and two AI researchers. They are all authors of the paper.

### Personal and Sensitive Information

To protect patient privacy, any personally identifiable information (PII) of patients, treatment regions, or other sensitive information has been manually identified and removed by the team of doctors. For detailed processing methods, please refer to the explanation in the Annotation process section.

## Considerations for Using the Data

### Discussion of Biases

The data in ClinicalBench comes from mainland China and only follows the officially recommended diagnostic methods and procedures in mainland China. Therefore, there may be a lack of representativeness for other regions and countries.

## Additional Information

### Dataset Curators

This dataset was initially curated by researchers from Vaneval.AI, Xi’an Jiaotong University, Xinxiang Medical University, Alibaba Group, University of Chinese Academy of Sciences, Beijing Tongren Hospital of Capital Medical University, School of Basic Medicine Peking Union Medical College. The author list of the paper is an accurate list of specific contributors. Dataset is managed by Vaneval AI.

### Licensing Information

- ClinicalBench is released under the ClinicalBench Usage and Data Distribution License Agreement. Please be sure to comply with the terms of use.
- The remaining code is released under the Apache-2.0 license. Please be sure to comply with the terms of use.

### Citation Information

Update coming soon.
